An Animated Guide © : Speed Merges : SAS V 9 . 1 Hashing
نویسنده
چکیده
Hashing is one of the fastest table lookup techniques, not just in SAS®, but in any programming language. Figure 1 illustrates the concept of a table lookup and the speed advantage of SAS V9 Hashing over a format table lookup. If a programmer needs to select, from a large file, all subjects that are in a small file, hashing will likely save disk space and time. Hashing should be part of the tool kit of every programmer who deals with large files. Hashing is designed to allow a programmer to subset a large file, based on information in a small file (as is shown in figure 1). While that is the designed reason for SAS adding this new feature, hashing can also replace by merges, format lookups and IORC merges. A major benefit of hashing, is that it can perform a table lookup using two unsorted datasets. Sorting is a disk and CPU intensive process. The techniques described in the series of papers named the Animated Guide©: Speed Merges: have, as a common theme, the desire to not sort the data. This paper, on SAS V9.1 Hashing, is the fourth in the speed merging series and recommends that V9.1 hashing be considered for general use because of its speed and ease of use. Hashing has been the subject of much mathematical and programming research. It has been proven, in certain common conditions, to be the theoretically fastest method of accessing data. Hashing gets its speed by: 1) a being a memory resident technique, 2) having a conceptually efficient methodology and 3) being very efficient programmatically. Being a memory resident technique avoids slow disk access. The logic of the hashing algorithm gives it advantages over other lookup techniques. SAS V9 hashing is implemented via two C language Objects that are accessed through the regular SAS data step: 1) Hash (which takes care of storage/management of information in the object) and 2) Iter (which takes care of movement up and down the object) Code your own hashing was introduced to SAS by Dr. Paul Dorfmann. His manually-coded techniques run in any version of SAS and are very fast. However, coding of his techniques, despite his excellent examples in SAS proceedings, has been considered difficult and customizing his techniques has been rarely attempted. V9.1 Hashing, as implemented by SAS, is considered easier to use than Dr. Dorfmanns techniques, though Dr. Dorfmanns techniques should be considered when speed is the critical. Some Logical result 02 20 21 04 40 41 Why Learn Hashing? Fast Table Lookup!! Large file has 1,000,000 Observations Total Format Format Format Number of obs Small File as Hash Time Hashing Time as % of Time Build Lookup in Small File % of Large File (Sec.) Total Format Time Total (Sec.) Time(Sec.)Time(Sec.) 10,000 1.0% 7 75.27% 9.3 2 7.3 100,000 10.0% 9 27.69% 32.5 20.2 12.3 500,000 50.0% 12 3.90% 307.7 280 27.7 DATA STEP Small File Key SV1 SV2 SV3 02 2 2.5 2.7 09 6 7.5 3.1 04 4 4.5 4.7 Large File Key varL1 varL2 03 30 31 01 10 11 05 50 51 02 20 21 04 40 41 06 60 61 Often a By Merge And the hash object requires less memory than a format.
منابع مشابه
Table Lookup by Direct Addressing: From V8 to V9
Table lookup is one of the most, if not the most, important and frequently performed data processing operations. The SAS System supports this assertion with a roster of built-in searching techniques, such as merges, joins, formats, indexes, and search-specific operators and functions. It is therefore all the more surprising that until the advent of Version 9 the fastest, i.e. direct-addressing ...
متن کاملMERGING: Comparing the DATA Step with SQL
Which merges files better: the SAS DATA Step or SAS SQL? Traditionally, the only way to merge files in SAS was via the SAS DATA Step. Now SAS provides a Structured Query Language (SQL) facility which also merges files. This tutorial compares and contrasts these two merge facilities. It examines the pros and cons of each merge technique. It looks at DATA Step code to perform specific merges an...
متن کاملTable Lookup via Direct Addressing: Key-Indexing, Bitmapping, Hashing
In SAS data processing, searching is one of the most frequent operations. Base SAS offers a rich collection of built-in searching techniques. MERGE, SQL joins, formats, SAS indexes all serve the purpose of the table lookup. For do-it-yourselfers, SAS offers arrays directly addressable data structures suited for implementing just about any searching algorithm. An array-based lookup is not a r...
متن کاملDeep Discrete Supervised Hashing
Hashing has been widely used for large-scale search due to its low storage cost and fast query speed. By using supervised information, supervised hashing can significantly outperform unsupervised hashing. Recently, discrete supervised hashing and deep hashing are two representative progresses in supervised hashing. On one hand, hashing is essentially a discrete optimization problem. Hence, util...
متن کاملPrivate Detectives In A Data Warehouse : Key - Indexing , Bitmapping , And Hashing
In data processing in general, and numerous aspects of Data Warehousing, in particular, searching is one of the most frequently performed operations. Base SAS offers a rich collection of built-in searching techniques. Merging and SQL joins, formats and SAS indexes all serve the purpose of looking up relevant data. In addition, SAS Language incorporates arrays – the data structures ideal for imp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004